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Abstract 

We propose a novel semantic segmentation algorithm by 
learning a deconvolution network. We learn the network 
on top of the convolutional layers adopted from VGG 16- 
layer net. The deconvolution network is composed of de- 
convolution and unpooling layers, which identify pixel-wise 
class labels and predict segmentation masks. We apply the 
trained network to each proposal in an input image, and 
construct the final semantic segmentation map by combin¬ 
ing the results from all proposals in a simple manner. The 
proposed algorithm mitigates the limitations of the exist¬ 
ing methods based on fully convolutional networks by in¬ 
tegrating deep deconvolution network and proposal-wise 
prediction; our segmentation method typically identifies de¬ 
tailed structures and handles objects in multiple scales nat¬ 
urally. Our network demonstrates outstanding performance 
in PASCAL VOC 2012 dataset, and we achieve the best ac¬ 
curacy (72.5%) among the methods trained with no external 
data through ensemble with the fully convolutional network. 



(b) Missing labels due to small object size 

Figure 1. Limitations of semantic segmentation algorithms based 
on fully convolutional network. (Left) original image. (Center) 
ground-truth annotation. (Right) segmentations by [17] 


1. Introduction 

Convolutional neural networks (CNN) have shown ex¬ 
cellent performance in various visual recognition problems 
such as image classification [15, 22, 23], object detec¬ 
tion [7, 9], semantic segmentation [6, 18], and action recog¬ 
nition [12, 21]. The representation power of CNNs leads 
to successful results; a combination of feature descriptors 
extracted from CNNs and simple off-the-shelf classifiers 
works very well in practice. Encouraged by the success 
in classification problems, researchers start to apply CNNs 
to structured prediction problems, i.e., semantic segmenta¬ 
tion [17, 1], human pose estimation [16], and so on. 

Recent semantic segmentation algorithms are often for¬ 
mulated to solve structured pixel-wise labeling problems 
based on CNN [ , 17]. They convert an existing CNN ar¬ 
chitecture constructed for classification to a fully convolu¬ 
tional network (FCN). They obtain a coarse label map from 
the network by classifying every local region in image, and 
perform a simple deconvolution, which is implemented as 


bilinear interpolation, for pixel-level labeling. Conditional 
random field (CRF) is optionally applied to the output map 
for fine segmentation [14]. The main advantage of the meth¬ 
ods based on FCN is that the network accepts a whole image 
as an input and performs fast and accurate inference. 

Semantic segmentation based on FCNs [1, 17] have a 
couple of critical limitations. First, the network can han¬ 
dle only a single scale semantics within image due to the 
fixed-size receptive field. Therefore, the object that is sub¬ 
stantially larger or smaller than the receptive field may be 
fragmented or mislabeled. In other words, label prediction 
is done with only local information for large objects and the 
pixels that belong to the same object may have inconsistent 
labels as shown in Figure 1(a). Also, small objects are often 
ignored and classified as background, which is illustrated in 
Figure 1(b). Although [17] attempts to sidestep this limi¬ 
tation using skip architecture, this is not a fundamental so¬ 
lution and performance gain is not significant. Second, the 
detailed structures of an object are often lost or smoothed 
because the label map, input to the deconvolutional layer. 
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is too coarse and deconvolution procedure is overly sim¬ 
ple. Note that, in the original FCN [17], the label map is 
only 16 X 16 in size and is deconvolved to generate seg¬ 
mentation result in the original input size through bilinear 
interpolation. The absence of real deconvolution in [1, 17] 
makes it difficult to achieve good performance. However, 
recent methods ameliorate this problem using CRF [14]. 

To overcome such limitations, we employ a completely 
different strategy to perform semantic segmentation based 
on CNN. Our main contributions are summarized below: 

• We learn a multi-layer deconvolution network, which 
is composed of deconvolution, unpooling, and rectified 
linear unit (ReLU) layers. Learning deconvolution net¬ 
work for semantic segmentation is meaningful but no 
one has attempted to do it yet to our knowledge. 

• The trained network is applied to individual object pro¬ 
posals to obtain instance-wise segmentations, which 
are combined for the final semantic segmentation; it 
is free from scale issues found in FCN-based methods 
and identifies finer details of an object. 

• We achieve outstanding performance using the decon¬ 
volution network trained only on PASCAL VOC 2012 
dataset, and obtain the best accuracy through the en¬ 
semble with [17] by exploiting the heterogeneous and 
complementary characteristics of our algorithm with 
respect to FCN-based methods. 

We believe that all of these three contributions help achieve 
the state-of-the-art performance in PASCAL VOC 2012 
benchmark. 

The rest of this paper is organized as follows. We first 
review related work in Section 2 and describe the architec¬ 
ture of our network in Section 3. The detailed procedure 
to learn a supervised deconvolution network is discussed 
in Section 4. Section 5 presents how to utilize the learned 
deconvolution network for semantic segmentation. Experi¬ 
mental results are demonstrated in Section 6. 

2. Related Work 

CNNs are very popular in many visual recognition prob¬ 
lems and have also been applied to semantic segmentation 
actively. We first summarize the existing algorithms based 
on supervised learning for semantic segmentation. 

There are several semantic segmentation methods based 
on classification. Mostajabi et al. [18] and Farabet et al. [6] 
classify multi-scale superpixels into predefined categories 
and combine the classification results for pixel-wise label¬ 
ing. Some algorithms [3, 9, 10] classify region proposals 
and refine the labels in the image-level segmentation map 
to obtain the final segmentation. 


Fully convolutional network (FCN) [17] has driven re¬ 
cent breakthrough on deep learning based semantic seg¬ 
mentation. In this approach, fully connected layers in the 
standard CNNs are interpreted as convolutions with large 
receptive fields, and segmentation is achieved using coarse 
class score maps obtained by feedforwarding an input im¬ 
age. An interesting idea in this work is that a simple inter¬ 
polation filter is employed for deconvolution and only the 
CNN part of the network is fine-tuned to learn deconvolu¬ 
tion indirectly. Surprisingly, the output network illustrates 
impressive performance on the PASCAL VOC benchmark. 
Chen et al. [1] obtain denser score maps within the FCN 
framework to predict pixel-wise labels and refine the label 
map using the fully connected CRF [14]. 

In addition to the methods based on supervised learning, 
several semantic segmentation techniques in weakly super¬ 
vised settings have been proposed. When only bounding 
box annotations are given for input images, [2, 19] refine 
the annotations through iterative procedures and obtain ac¬ 
curate segmentation outputs. On the other hand, [20] per¬ 
forms semantic segmentation based only on image-level an¬ 
notations in a multiple instance learning framework. 

Semantic segmentation involves deconvolution concep¬ 
tually, but learning deconvolution network is not very com¬ 
mon. Deconvolution network is introduced in [25] to re¬ 
construct input images. As the reconstruction of an input 
image is non-trivial due to max pooling layers, it proposes 
the unpooling operation by storing the pooled location. Us¬ 
ing the deconvoluton network, the input image can be re¬ 
constructed from its feature representation. This approach 
is also employed to visualize activated features in a trained 
CNN [24] and update network architecture for performance 
enhancement. This visualization is useful for understanding 
the behavior of a trained CNN model. 

3. System Architecture 

This section discusses the architecture of our deconvolu¬ 
tion network, and describes the overall semantic segmenta¬ 
tion algorithm. 

3.1. Architecture 

Figure 2 illustrates the detailed configuration of the en¬ 
tire deep network. Our trained network is composed of two 
parts—convolution and deconvolution networks. The con¬ 
volution network corresponds to feature extractor that trans¬ 
forms the input image to multidimensional feature represen¬ 
tation, whereas the deconvolution network is a shape gen¬ 
erator that produces object segmentation from the feature 
extracted from the convolution network. The final output of 
the network is a probability map in the same size to input 
image, indicating probability of each pixel that belongs to 
one of the predefined classes. 



Figure 2. Overall architecture of the proposed network. On top of the convolution network based on VGG 16-layer net, we put a multi¬ 
layer deconvolution network to generate the accurate segmentation map of an input proposal. Given a feature representation obtained from 
the convolution network, dense pixel-wise class prediction map is constructed through multiple series of unpooling, deconvolution and 
rectification operations. 


We employ VGG 16-layer net [22] for convolutional part 
with its last classification layer removed. Our convolution 
network has 13 convolutional layers altogether, rectifica¬ 
tion and pooling operations are sometimes performed be¬ 
tween convolutions, and 2 fully connected layers are aug¬ 
mented at the end to impose class-specific projection. Our 
deconvolution network is a mirrored version of the convo¬ 
lution network, and has multiple series of unpooing, decon¬ 
volution, and rectification layers. Contrary to convolution 
network that reduces the size of activations through feed¬ 
forwarding, deconvolution network enlarges the activations 
through the combination of unpooling and deconvolution 
operations. More details of the proposed deconvolution net¬ 
work is described in the following subsections. 

3.2. Deconvolution Network for Segmentation 

We now discuss two main operations, unpooling and de- 
convolution, in our deconvolution network in details. 

3.2.1 Unpooling 

Pooling in convolution network is designed to filter noisy 
activations in a lower layer by abstracting activations in a 
receptive field with a single representative value. Although 
it helps classification by retaining only robust activations in 
upper layers, spatial information within a receptive field is 
lost during pooling, which may be critical for precise local¬ 
ization that is required for semantic segmentation. 

To resolve such issue, we employ unpooling layers in de- 
convolution network, which perform the reverse operation 
of pooling and reconstruct the original size of activations as 
illustrated in Figure 3. To implement the unpooling opera¬ 
tion, we follow the similar approach proposed in [24, 25]. It 
records the locations of maximum activations selected dur¬ 
ing pooling operation in switch variables, which are em¬ 
ployed to place each activation back to its original pooled 
location. This unpooling strategy is particularly useful to 
reconstruct the structure of input object as described in [24]. 



Convolution Deconvolution 


Figure 3. Illustration of deconvolution and unpooling operations. 

3.2.2 Deconvolution 

The output of an unpooling layer is an enlarged, yet sparse 
activation map. The deconvolution layers density the sparse 
activations obtained by unpooling through convolution-like 
operations with multiple learned filters. However, contrary 
to convolutional layers, which connect multiple input ac¬ 
tivations within a filter window to a single activation, de- 
convolutional layers associate a single input activation with 
multiple outputs, as illustrated in Figure 3. The output of 
the deconvolutional layer is an enlarged and dense activa¬ 
tion map. We crop the boundary of the enlarged activation 
map to keep the size of the output map identical to the one 
from the preceding unpooling layer. 

The learned filters in deconvolutional layers correspond 
to bases to reconstruct shape of an input object. Therefore, 
similar to the convolution network, a hierarchical structure 
of deconvolutional layers are used to capture different level 
of shape details. The filters in lower layers tend to cap¬ 
ture overall shape of an object while the class-specific fine- 
details are encoded in the filters in higher layers. In this 
way, the network directly takes class-specific shape infor- 
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Figure 4. Visualization of activations in our deconvolution network. The activation maps from top left to bottom right correspond to the 
output maps from lower to higher layers in the deconvolution network. We select the most representative activation in each layer for 
effective visualization. The image in (a) is an input, and the rest are the outputs from (b) the last 14 x 14 deconvolutional layer, (c) the 
28 X 28 unpooling layer, (d) the last 28 x 28 deconvolutional layer, (e) the 56 x 56 unpooling layer, (f) the last 56 x 56 deconvolutional 
layer, (g) the 112 x 112 unpooling layer, (h) the last 112 x 112 deconvolutional layer, (i) the 224 x 224 unpooling layer and (j) the last 
224 X 224 deconvolutional layer. The finer details of the object are revealed, as the features are forward-propagated through the layers in 
the deconvolution network. Note that noisy activations from background are suppressed through propagation while the activations closely 
related to the target classes are amplified. It shows that the learned filters in higher deconvolutional layers tend to capture class-specific 
shape information. 


mation into account for semantic segmentation, which is 
often ignored in other approaches based only on convolu¬ 
tional layers [1, 17]. 

3.2.3 Analysis of Deconvolution Network 

In the proposed algorithm, the deconvolution network is a 
key component for precise object segmentation. Contrary 
to the simple deconvolution in [17] performed on coarse ac¬ 
tivation maps, our algorithm generates object segmentation 
masks using deep deconvolution network, where a dense 
pixel-wise class probability map is obtained by successive 
operations of unpooling, deconvolution, and rectification. 

Figure 4 visualizes the outputs from the network layer by 
layer, which is helpful to understand internal operations of 
our deconvolution network. We can observe that coarse-to- 
fine object structures are reconstructed through the propaga¬ 
tion in the deconvolutional layers; lower layers tend to cap¬ 
ture overall coarse configuration of an object (e.g. location, 
shape and region), while more complex patterns are discov¬ 
ered in higher layers. Note that unpooling and deconvolu¬ 
tion play different roles for the construction of segmentation 
masks. Unpooling captures example-specific structures by 
tracing the original locations with strong activations back 
to image space. As a result, it effectively reconstructs the 


detailed structure of an object in finer resolutions. On the 
other hand, learned filters in deconvolutional layers tend to 
capture class-specific shapes. Through deconvolutions, the 
activations closely related to the target classes are amplified 
while noisy activations from other regions are suppressed 
effectively. By the combination of unpooling and deconvo¬ 
lution, our network generates accurate segmentation maps. 

Figure 5 illustrates examples of outputs from FCN-8s 
and the proposed network. Compared to the coarse acti¬ 
vation map of FCN-8s, our network constructs dense and 
precise activations using the deconvolution network. 

3.3. System Overview 

Our algorithm poses semantic segmentation as instance- 
wise segmentation problem. That is, the network takes 
a sub-image potentially containing objects—which we re¬ 
fer to as instance(s) afterwards—as an input and produces 
pixel-wise class prediction as an output. Given our network, 
semantic segmentation on a whole image is obtained by ap¬ 
plying the network to each candidate proposals extracted 
from the image and aggregating outputs of all proposals to 
the original image space. 

Instance-wise segmentation has a few advantages over 
image-level prediction. It handles objects in various scales 











(a) Input image (b) FCN-8s (c) Ours 

Figure 5. Comparison of class conditional probability maps from 
FCN and our network (top: dog, bottom: bicycle). 


effectively and identifies fine details of objects while the ap¬ 
proaches with fixed-size receptive fields have troubles with 
these issues. Also, it alleviates training complexity by re¬ 
ducing search space for prediction and reduces memory re¬ 
quirement for training. 


pared to the number of training examples and the benefit 
to use a deconvolution network for instance-wise segmen¬ 
tation would be cancelled. Then, we employ a two-stage 
training method to address this issue, where we train the 
network with easy examples first and fine-tune the trained 
network with more challenging examples later. 

To construct training examples for the first stage training, 
we crop object instances using ground-truth annotations so 
that an object is centered at the cropped bounding box. By 
limiting the variations in object location and size, we re¬ 
duce search space for semantic segmentation significantly 
and train the network with much less training examples suc¬ 
cessfully. In the second stage, we utilize object proposals 
to construct more challenging examples. Specifically, can¬ 
didate proposals sufficiently overlapped with ground-truth 
segmentations are selected for training. Using the proposals 
to construct training data makes the network more robust to 
the misalignment of proposals in testing, but makes training 
more challenging since the location and scale of an object 
may be significantly different across training examples. 

5. Inference 


4. Training 

The entire network described in the previous section is 
very deep (twice deeper than [22]) and contains a lot of as¬ 
sociated parameters. In addition, the number of training ex¬ 
amples for semantic segmentation is relatively small com¬ 
pared to the size of the network—12031 PASCAL training 
and validation images in total. Training a deep network with 
a limited number of examples is not trivial and we train the 
network successfully using the following ideas. 

4.1. Batch Normalization 

It is well-known that a deep neural network is very hard 
to optimize due to the internal-covariate-shift problem [11]; 
input distributions in each layer change over iteration during 
training as the parameters of its previous layers are updated. 
This is problematic in optimizing very deep networks since 
the changes in distribution are amplified through propaga¬ 
tion across layers. 

We perform the batch normalization [11] to reduce the 
intemal-covariate-shift by normalizing input distributions 
of every layer to the standard Gaussian distribution. For 
the purpose, a batch normalization layer is added to the out¬ 
put of every convolutional and deconvolutional layer. We 
observe that the batch normalization is critical to optimize 
our network; it ends up with a poor local optimum without 
batch normalization. 

4.2. Two-stage Training 

Although batch normalization helps escape local optima, 
the space of semantic segmentation is still very large com¬ 


The proposed network is trained to perform semantic 
segmentation for individual instances. Given an input im¬ 
age, we first generate a sufficient number of candidate pro¬ 
posals, and apply the trained network to obtain semantic 
segmentation maps of individual proposals. Then we ag¬ 
gregate the outputs of all proposals to produce semantic 
segmentation on a whole image. Optionally, we take en¬ 
semble of our method with FCN [17] to further improve 
performance. We describe detailed procedure in the follow¬ 
ing. 

5.1. Aggregating Instance-wise Segmentation Maps 

Since some proposals may result in incorrect predictions 
due to misalignment to object or cluttered background, we 
should suppress such noises during aggregation. The pixel- 
wise maximum or average of the score maps corresponding 
all classes turns out to be sufficiently effective to obtain ro¬ 
bust results. 

Let gi G output score maps of the ith 

proposal, where W x H and C denote the size of proposal 
and the number of classes, respectively. We first put it on 
image space with zero padding outside gi; we denote the 
segmentation map corresponding to gi in the original image 
size by Gi hereafter. Then we construct the pixel-wise class 
score map of an image by aggregating the outputs of all 
proposals by 

P{x,y,c) =m^Gi{x,y,c), Mi, (1) 

or 

P{.x, y,c) = Gi{x, y, c), Mi. (2) 







Class conditional probability maps in the original image 
space are obtained by applying softmax function to the ag¬ 
gregated maps obtained by Eq. (1) or (2). Finally, we apply 
the fully-connected CRF [14] to the output maps for the fi¬ 
nal pixel-wise labeling, where unary potential are obtained 
from the pixel-wise class conditional probability maps. 

5.2. Ensemble with FCN 

Our algorithm based on the deconvolution network has 
complementary characteristics to the approaches relying on 
FCN; our deconvolution network is appropriate to capture 
the fine-details of an object, whereas FCN is typically good 
at extracting the overall shape of an object. In addition, 
instance-wise prediction is useful for handling objects with 
various scales, while fully convolutional network with a 
coarse scale may be advantageous to capture context within 
image. Exploiting these heterogeneous properties may lead 
to better results, and we take advantage of the benefit of 
both algorithms through ensemble. 

We develop a simple method to combine the outputs of 
both algorithms. Given two sets of class conditional prob¬ 
ability maps of an input image computed independently by 
the proposed method and FCN, we compute the mean of 
both output maps and apply the CRF to obtain the final se¬ 
mantic segmentation. 

6. Experiments 

This section first describes our implementation details 
and experiment setup. Then, we analyze and evaluate the 
proposed network in various aspects. 

6.1. Implementation Details 

Network Configuration Table 2 summarizes the detailed 
configuration of the proposed network presented in Fig¬ 
ure 2. Our network has symmetrical configuration of convo¬ 
lution and deconvolution network centered around the 2nd 
fully-connected layer (fc7). The input and output layers 
correspond to input image and class conditional probabil¬ 
ity maps, respectively. The network contains approximately 
252M parameters in total. 

Dataset We employ PASCAF VOC 2012 segmentation 
dataset [5] for training and testing the proposed deep net¬ 
work. For training, we use augmented segmentation annota¬ 
tions from [8], where all training and validation images are 
used to train our network. The performance of our network 
is evaluated on test images. Note that only the images in 
PASCAF VOC 2012 datasets are used for training in our ex¬ 
periment, whereas some state-of-the-art algorithms [2, 19] 
employ additional data to improve performance. 


Training Data Construction We employ a two-stage 
training strategy and use a separate training dataset in each 
stage. To construct training examples for the first stage, 
we draw a tight bounding box corresponding to each anno¬ 
tated object in training images, and extend the box 1.2 times 
larger to include local context around the object. Then we 
crop the window using the extended bounding box to obtain 
a training example. The class label for each cropped region 
is provided based only on the object located at the center 
while all other pixels are labeled as background. In the sec¬ 
ond stage, each training example is extracted from object 
proposal [26], where all relevant class labels are used for 
annotation. We employ the same post-processing as the one 
used in the first stage to include context. For both datasets, 
we maintain the balance for the number of examples across 
classes by adding redundant examples for the classes with 
limited number of examples. To augment training data, we 
transform an input example toa250 x 250 image and ran¬ 
domly crop the image to 224 x 224 with optional horizontal 
flipping in a similar way to [22] . The number of training ex¬ 
amples is 0.2M and 2.7M in the first and the second stage, 
respectively, which is sufficiently large to train the decon¬ 
volution network from scratch. 

Optmization We implement the proposed network based 
on Caffe [13] framework. The standard stochastic gradi¬ 
ent descent with momentum is employed for optimization, 
where initial learning rate, momentum and weight decay are 
set to 0.01, 0.9 and 0,0005, respectively. We initialize the 
weights in the convolution network using VGG 16-layer net 
pre-trained on IFSVRC [4] dataset, while the weights in the 
deconvolution network are initialized with zero-mean Gaus- 
sians. We remove the drop-out layers due to batch normal¬ 
ization, and reduce learning rate in an order of magnitude 
whenever validation accuracy does not improve. Although 
our final network is learned with both train and validation 
datasets, learning rate adjustment based on validation ac¬ 
curacy still works well according to our experience. The 
network converges after approximately 20K and 40K SGD 
iterations with mini-batch of 64 samples in the first and sec¬ 
ond stage training, respectively. Training takes 6 days (2 
days for the first stage and 4 days for the second stage) in a 
single Nvidia GTX Titan X GPU with 12G memory. 

Inference We employ edge-box [26] to generate object 
proposals. For each testing image, we generate approxi¬ 
mately 2000 object proposals, and select top 50 proposals 
based on their objectness scores. We observe that this num¬ 
ber is sufficient to obtain accurate segmentation in practice. 
To obtain pixel-wise class conditional probability maps for 
a whole image, we compute pixel-wise maximum to aggre¬ 
gate proposal-wise predictions as in Eq. (1). 


Table 1. Evaluation results on PASCAL VOC 2012 test set. (Asterisk (*) denotes the algorithms trained with additional data.) 
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Figure 6. Benefit of instance-wise prediction. We aggregate the proposals in a decreasing order of their sizes. The algorithm identifies 
finer object structures through iterations by handling multi-scale objects effectively. 


6.2. Evaluation on Pascal VOC 

We evaluate our network on PASCAL VOC 2012 bench¬ 
mark [5], which contains 1456 test images and involves 20 
object categories. We adopt comp6 evaluation protocol that 
measures scores based on Intersection over Union (loU) be¬ 
tween ground truth and predicted segmentations. 

The quantitative results of the proposed algorithm and 
the competitors are presented in Table 1 ^ , where our method 
is denoted by DeconvNet. The performance of DeconvNet 
is competitive to the state-of-the-art methods. The CRF [14] 
as post-processing enhances accuracy by approximately 1 % 
point. We further improve performance through an ensem¬ 
ble with FCN-8s. It improves mean loU about 10.3% and 
3.1% point with respect to FCN-8s and our DeconvNet, re¬ 
spectively, which is notable considering relatively low accu¬ 
racy of FCN-8s. We believe that this is because our method 
and FCN have complementary characteristics as discussed 
in Section 5.2; this property differentiates our algorithm 
from the existing ones based on FCN. Our ensemble method 
with FCN-8s denoted by EDeconvNet achieves the best ac¬ 
curacy among methods trained only on PASCAL VOC data. 

Figure 6 demonstrates effectiveness of instance-wise 

^ All numbers in this table are from the officially published papers, not 
from the leaderboard, including the ones in arXiv. 


prediction for accurate segmentation. We aggregate the pro¬ 
posals in a decreasing order of their sizes and observe the 
progress of segmentation. As the number of aggregated pro¬ 
posals increases, the algorithm identifies finer object struc¬ 
tures, which are typically captured by small proposals. 

The qualitative results of DeconvNet, FCN and their en¬ 
semble are presented in Figure 7. Overall, DeconvNet pro¬ 
duces fine segmentations compared to FCN, and handles 
multi-scale objects effectively through instance-wise pre¬ 
diction. FCN tends to fail in labeling too large or small ob¬ 
jects (Figure 7(a)) due to its fixed-size receptive field. Our 
network sometimes returns noisy predictions (Figure 7(b)), 
when the proposals are misaligned or located at background 
regions. The ensemble with FCN-8s produces much bet¬ 
ter results as observed in Figure 7(a) and 7(b). Note that 
inaccurate predictions from both FCN and DeconvNet are 
sometimes corrected by ensemble as shown in Figure 7(c). 
Adding CRF to ensemble improves quantitative perfor¬ 
mance, although the improvement is not significant. 

7. Conclusion 

We proposed a novel semantic segmentation algorithm 
by learning a deconvolution network. The proposed de- 
convolution network is suitable to generate dense and pre- 
















Input image Ground-truth FCN DeconvNet EDeconvNet EDeconvNet-i-CRF 



(a) Examples that our method produces better results than FCN [17]. 



(b) Examples that FCN produces better results than our method. 



(c) Examples that inaccurate predictions from our method and FCN are improved by ensemble. 


Figure 7. Example of semantic segmentation results on PASCAL VOC 2012 validation images. Note that the proposed method and FCN 
have complementary characteristics for semantic segmentation, and the combination of both methods improves accuracy through ensemble. 
Although CRF removes some noises, it does not improve quantitative performance of our algorithm significantly. 
























Table 2. Detailed configuration of the proposed network, “conv” 
and “deconv” denote layers in convolution and deconvolution net¬ 
work, respectively, while numbers next to each layer name mean 
the order of the corresponding layer in the network. ReLU layers 
are omitted from the table for brevity. 


name 

kernel 

size 

stride 

pad 

output size 

input 


- 


- 

- 

224 X 224 X 3 

convl-1 

3 

X 

3 

1 

1 

224 X 224 X 64 

convl-2 

3 

X 

3 

1 

1 

224 X 224 X 64 

pooll 

2 

X 

2 

2 

0 

112 X 112 X 64 

conv2-l 

3 

X 

3 

1 

1 

112 X 112 X 128 

conv2-2 

3 

X 

3 

1 

1 

112 X 112 X 128 

pool2 

2 

X 

2 

2 

0 

56 X 56 X 128 

conv3-l 

3 

X 

3 

1 

1 

56 X 56 X 256 

conv3-2 

3 

X 

3 

1 

1 

56 X 56 X 256 

conv3-3 

3 

X 

3 

1 

1 

56 X 56 X 256 

pool3 

2 

X 

2 

2 

0 

28 X 28 X 256 

conv4-l 

3 

X 

3 

1 

1 

28 X 28 X 512 

conv4-2 

3 

X 

3 

1 

1 

28 X 28 X 512 

conv4-3 

3 

X 

3 

1 

1 

28 X 28 X 512 

pool4 

2 

X 

2 

2 

0 

14 X 14 X 512 

conv5-l 

3 

X 

3 

1 

1 

14 X 14 X 512 

conv5-2 

3 

X 

3 

1 

1 

14 X 14 X 512 

conv5-3 

3 

X 

3 

1 

1 

14 X 14 X 512 

pool5 

2 

X 

2 

2 

0 

7 X 7 X 512 

fc6 

7 

X 

7 

1 

0 

1 X 1 X 4096 

fc7 

1 

X 

1 

1 

0 

1 X 1 X 4096 

deconv-fc6 

7 

X 

7 

1 

0 

7 X 7 X 512 

unpoolS 

2 

X 

2 

2 

0 

14 X 14 X 512 

deconv5-l 

3 

X 

3 

1 

1 

14 X 14 X 512 

deconv5-2 

3 

X 

3 

1 

1 

14 X 14 X 512 

deconv5-3 

3 

X 

3 

1 

1 

14 X 14 X 512 

unpool4 

2 

X 

2 

2 

0 

28 X 28 X 512 

deconv4-l 

3 

X 

3 

1 

1 

28 X 28 X 512 

deconv4-2 

3 

X 

3 

1 

1 

28 X 28 X 512 

deconv4-3 

3 

X 

3 

1 

1 

28 X 28 X 256 

unpoolS 

2 

X 

2 

2 

0 

56 X 56 X 256 

deconv3-l 

3 

X 

3 

1 

1 

56 X 56 X 256 

deconv3-2 

3 

X 

3 

1 

1 

56 X 56 X 256 

deconv3-3 

3 

X 

3 

1 

1 

56 X 56 X 128 

unpool2 

2 

X 

2 

2 

0 

112 X 112 X 128 

deconv2-l 

3 

X 

3 

1 

1 

112 X 112 X 128 

deconv2-2 

3 

X 

3 

1 

1 

112 X 112 X 64 

unpool 1 

2 

X 

2 

2 

0 

224 X 224 X 64 

deconvl-1 

3 

X 

3 

1 

1 

224 X 224 X 64 

deconvl-2 

3 

X 

3 

1 

1 

224 X 224 X 64 

output 

1 

X 

1 

1 

1 

224 X 224 X 21 


cise object segmentation masks since coarse-to-fine struc¬ 
tures of an object is reconstructed progressively through 
a sequence of deconvolution operations. Our algorithm 
based on instance-wise prediction is advantageous to han¬ 
dle object scale variations by eliminating the limitation 
of fixed-size receptive field in the fully convolutional net¬ 
work. We further proposed an ensemble approach, which 
combines the outputs of the proposed algorithm and FCN- 
based method, and achieved substantially better perfor¬ 
mance thanks to complementary characteristics of both al¬ 
gorithms. Our network demonstrated the state-of-the-art 
performance in PASCAL VOC 2012 segmentation bench¬ 
mark among the methods trained with no external data. 
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